Transfer Functions
2024-12-05
In the 1970s and early 1980s there was a great deal of concern about acid lakes and rivers in northern Europe
Driven mainly by losses of Salmon in Scandinavian rivers, this was a major political hot potato
A vast amount of money was expended to determine the cause of the acidification — was it due to acid emissions from power stations or some other cause?
Palaeolimnological data provided conclusive proof that acid deposition was the cause In Europe, the Surface Waters Acidification Project (SWAP) was a major contributor to the debate
Diatoms collected from 167 lakes across UK, Norway, Sweden and associated water chemistry
Can we predict lake-water pH from the diatom species assemblages?
Apply to diatoms counted from a sediment core from the Round Loch of Glenhead (RLGH) covering most of the Holocene
Sea surface temperatures are related to global air temperatures
An important arm of palaeoceanography is involved in reconstructing past climates from various proxies
These past climates tell use how the world responded to previous climatic shifts and provide targets for climate modellers to try to model
The data set here is the Imbrie & Kipp data set — the data set that started it all! 61 core-top samples from ocean cores, mainly from Atlantic
27 species of planktonic foraminifera were identified in the core-top samples
Summer and Winter sea surface temperatures (SST) and sea water salinity values measured at each of the 61 core locations
Applied to reconstruct SST and salinity for 110 samples from Core V12-133 from the Caribbea
Aim: Predict the environment from observations on species & environment
Transfer functions
Calibration
Bioindication
Inverse of constrained ordination
ter Braak (1995) Chemometrics and Intelligent Laboratory Systems 28: 165–180
Matrix of species abundances \(\mathbf{Y}\)
Vector of observations of an environmental variable \(\mathbf{x}\)
Assume \(\mathbf{Y}\) is some function \(f\) of the environment plus an error term
\[ \mathbf{Y} = f(\mathbf{x}) + \varepsilon \]
In the classical approach \(f\) is estimated via regression of \(\mathbf{Y}\) on \(\mathbf{x}\)
Then invert \(f\), \(f^{-1}\), to yield an estimate of environment \(\mathbf{x_0}\) from fossil species assemblage \(\mathbf{y_0}\)
\[ \mathbf{\hat{x}_0} = f(\mathbf{y_0})^{-1} \]
In all but simplest cases \(f^{-1}\) doesn’t exist and must be estimated via optimisation
To avoid problems of inverting \(f\), the indirect approach directly estimates the inverse of \(f\), here \(g\), from the data by regression \(\mathbf{x}\) on \(\mathbf{Y}\)
\[ \mathbf{x} = g(\mathbf{Y}) + \varepsilon \]
We do not believe that the species influence their environment!
This is just a trick to avoid having to estimate \(f\)
The predicted environment for a fossil sample \(\mathbf{y_0}\) is
\[\mathbf{\hat{x}_0} = g(\mathbf{y_0})\]
Taxa in training set are systematically related to the environment in which they live Environmental variable to be reconstructed is, or is linearly related to, an ecologically important variable in the ecosystem
Taxa in the training set are the same as in the fossil data and their ecological responses have not changed significantly over the timespan represented by the fossil assemblages
Mathematical methods used in regression and calibration adequately model the biological responses to the environment
Other environmental variables have negligible influence, or their joint distribution with the environmental variable of interest is the same as in the training set
In model evaluation by cross-validation, the test data are independent of the training data — the secret assumption until Telford & Birks (2005)
There are quite a few ways of estimating \(f\)
Large number of potential techniques from machine learning, bioinformatics, that have yet to be investigated
Species don’t respond in simple ways to environmental gradients
Maximum likelihood method fitted Gaussian curves to each species and then numerical optimisation used to predict for fossil samples
Computationally very intensive, especially when doing cross-validation
Weighted averaging is an approximation to this maximum likelihood approach
A very simple idea
In a lake, with a certain pH, a species with their pH optima close to the pH of the lake will tend to be the most abundant species present
A simple estimate of the a species’ pH optimum is an average of all the pH values for lakes in which that species occurs, weighted by their abundance
An estimate of a lake’s pH is the weighted average of the pH optima of all the species present, again weighted by species abundance
By taking averages twice, the range of predicted values is smaller than the observed range
Deshrinking regressions stretch the weighted averages back out to the observed range
Inverse and classical deshrinking regressions
inverse but using a monotonic splineInverse and classical regression remove both bias and error, equalising variances deshrinks without adjusting the bias
analogue contains R code for fitting WA transfer functions and associated helper functions
Weighted Averaging Transfer Function
Call:
wa(formula = SumSST ~ ., data = ik, deshrink = "inverse")
Deshrinking : Inverse
Tolerance DW : No
No. samples : 61
No. species : 27
Performance:
RMSE R-squared Avg. Bias Max. Bias
2.0188 0.9173 0.0000 -3.8155
Weighted Averaging Predictions
Call:
predict(object = wa_m, newdata = V12.122)
Deshrinking : Inverse
Crossvalidation : none
Tolerance DW : No
Performance:
RMSEP R2 Avg.Bias Max.Bias
2.0188 0.9173 0.0000 -3.8155
Predictions:
0 10 20 30 40 50 60 70 80 90
26.8321 26.7870 26.5611 26.1722 26.1857 26.1670 25.9064 26.0574 26.2797 25.6723
100 110 120 130 140 150 160 170 180 190
26.1054 25.6092 25.8379 25.7696 25.7891 26.0105 25.8400 26.1986 26.0054 26.4729
200 210 220 230 240 250 260 270 280 290
26.4282 26.5318 26.7689 26.7812 26.8077 26.0786 26.4078 26.3981 26.1494 26.4148
300 310 320 330 340 350 360 370 380 390
26.2799 25.8553 26.0269 25.3974 26.0271 26.2423 26.3020 26.7047 26.7140 26.2727
400 410 420 430 440 450 460 470 480 490
25.4927 26.7538 26.6039 26.6019 26.1936 26.7939 26.7742 26.2152 25.4620 26.7682
500 510 520 530 540 550 560 570 580 590
26.8107 26.2679 25.7851 25.8562 25.5992 25.0000 25.3488 25.3794 25.3995 26.5347
600 610 620 630 640 650 660 670 680 690
26.1509 26.1765 26.1447 25.8472 26.3835 26.3507 26.0932 24.5383 25.3052 26.6331
700 710 720 730 740 750 760 770 780 790
26.3173 26.4848 26.0882 26.1193 26.1579 26.0043 26.3400 26.6920 26.9768 26.9926
800 810 820 830 840 850 860 870 880 890
26.8074 26.4448 25.4736 25.8549 26.0450 26.2881 25.6021 26.1688 25.8223 24.1910
900 910 920 930 940 950 960 970 980 990
24.4447 24.9817 25.4642 26.2359 26.4497 26.2772 26.1387 26.1874 25.8485 25.7372
1000 1010 1020 1030 1040 1050 1060 1070 1080 1090
25.8538 24.8725 24.1065 24.4843 24.1864 25.6200 25.1869 24.8619 26.0186 25.6395
We might expect those taxa with a narrow (realised) niche to be better indicators of \(\mathbf{x}\) than those taxa with a wide niche
Hence development of tolerance down-weighting in WA
Basically, weight taxa inversely in proportion to their estimated tolerance width
Sounds simple, but isn’t — could give infinite weight to a taxon found in a single training set sample!
If done correctly it produces very competitive models
Taxa with small tolerances get their tolerance replaced by
Also use Hill’s N2 when computing tolerance widths to insure they are unbiased
WA take a species approach to reconstruction — each species in the fossil sample that is also in the training set contributes to the reconstructed values
MAT takes a more holistic approach — we predict on basis of similar assemblages
In MAT, only the most similar assemblages contribute to the fitted values
MAT is steeped in the tradition of uniformitarianism — the present is the key to the past
We take as our prediction of the environment of the past, the (possibly weighted) average of the environment of the \(k\) sites with the most similar assemblages
Several things to define; \(k\), (dis)similarity
MAT is \(k\) nearest neighbours (\(k\)-NN) regression/calibration
If you want to fit MAT models, my J. Stat. Soft. paper has fully-working examples to follow
Simpson, G.L., 2007. Analogue Methods in Palaeoecology: Using the analogue Package. J. Stat. Softw. 22, 1–29.
Or use rioja::MAT(), especially if you want h-block CV
analogue is getting a little out-dated now — it needs some love
rioja by is very similar, provides more functionality (WA-PLS, Maximum likelihood TFs, …) but most importantly, h-block cross-validation!
analogue has better support for the Modern Analogue Technique
Method : Weighted Averaging
Call : WA(y = ik, x = SumSST)
Tolerance DW : No
Monotonic deshrink : No
No. samples : 61
No. species : 27
Cross val. : none
Deshrinking regression coefficients:
Inverse d/s Classical d/s
wa.b0 -5.6876 5.8905
wa.b1 1.2659 0.7246
Performance:
RMSE R2 Avg.Bias Max.Bias Skill
WA.inv 2.0188 0.9173 0 3.8155 91.7299
WA.cla 2.1078 0.9173 0 3.1324 90.9842
Principal Component Regression — decompose \(\mathbf{Y}\) into PCA axes and use the first \(l\) of these axes to predict \(\mathbf{x}\): analogue::pcr()
Partial Least Squares is similar — find orthogonal components in \(\mathbf{Y}\) that are most correlated with \(\mathbf{x}\), and use the first \(l\) to predict \(\mathbf{x}\)
WA-PLS is a non-linear version of PLS — use row & column sums to turn a linear method into a non-linear one, just like CCA does: rioja::WAPLS()
An important aspect of fitting these models is to choose \(l\) — the number of components to retain
Bias is the tendency for the model to over or under predict
Average bias is the mean of the residuals
Maximum bias is found by breaking the range of the measured environment into \(n\) contiguous chunks (\(n = 10\) usually)
Within each chunk calculate the mean of the residuals for that chunk
Take the maximum value of these as the maximum bias statistic
Without cross-validation, prediction errors, measured by RMSEP, will be biased, often badly so, because we use the same data to both fit & test the model
Ideally we’d have such a large training set that we can split this into a slightly smaller training set and a small test set
Palaeoecological data is expensive to obtain — in money and person-hours!
Also these ecosystems are complex, species rich, noisy etc., so we want to use all our data to produce a model
One solution to this problem is to use cross-validation
General idea: perturb the training set in some way, build a new model on the perturbed training set and assess how well it performs
If we repeat the perturbation several time we get an idea of the error in the model
Several techniques; n-fold, leave-one-out (LOO), bootstrapping, h-block CV
In analogue, several methods are available
For MAT models, LOO is built into the procedure so only bootstrapping is available
For WA models, both LOO and bootstrapping currently available
\(n\)-fold CV & h-block will be available in a future version (!)
LOO CV is very simple
In turn, leave out each sample from the training set
Build a model on the remaining samples
Predict for the left out sample
Calculate the RMSEP of these predictions
Leave one out sample 10
Leave one out sample 20
Leave one out sample 30
Leave one out sample 40
Leave one out sample 50
Leave one out sample 60
RMSE R2 Avg.Bias Max.Bias
2.019 0.917 0.000 -3.815
RMSEP R2 Avg.Bias Max.Bias
2.218 0.900 -0.014 -4.599
Method : Weighted Averaging
Call : WA(y = ik, x = SumSST)
Tolerance DW : No
Monotonic deshrink : No
No. samples : 61
No. species : 27
Cross val. : loo
Deshrinking regression coefficients:
Inverse d/s Classical d/s
wa.b0 -5.6876 5.8905
wa.b1 1.2659 0.7246
Performance:
RMSE R2 Avg.Bias Max.Bias Skill
WA.inv 2.0188 0.9173 0.0000 3.8155 91.7299
WA.cla 2.1078 0.9173 0.0000 3.1324 90.9842
WA.inv_XVal 2.2179 0.9003 -0.0137 4.5985 90.0176
WA.cla_XVal 2.3073 0.9018 -0.0067 3.9326 89.1969
n-fold CV or leave-group-out CV
Same as LOO but instead of leaving out one sample at a time, we leave out an entire group of samples
n is usually 10
h-block CV very similar to LOO & \(n\)-fold
Instead of breaking training set into n groups & leaving one out at a time, we leave out all observations withinh distance of the target sample
Repeat for each sample in turn (like LOO)
Useful when training data are autocorrelated — as they often are in marine settings
USe rioja::crossval() to do h-block
Bootstrapping used in machine learning to improve predictions
Use bootstrapping to get more realistic RMSEP and bias statistics
We draw a bootstrap sample (sampling with replacement) of the same size as our training set
Build a model on the bootstrap samples
Predict for the out-of-bag (OOB) samples
Bootstrap prediction for each model sample is the mean of the OOB prediction for each sample
Calculate the residuals and then the RMSEP
\[\mathrm{RMSEP_{boot}} = \sqrt{s_1^2 + s_2^2}\]
\(s_1^2\) is the standard deviation of the OOB residuals
\(s_2^2\) is the mean of the OOB residuals
We can also calculate the more usual RMSEP \(\sqrt{\sum_{i=1}^n (y_i - \hat{y}_i)^2 / n}\)
Weighted Averaging Bootstrap Predictions
Call:
bootstrap(object = wa_m, n.boot = 999, verbose = FALSE)
Deshrinking : Inverse
Crossvalidation : bootstrap
Tolerance DW : No
Performance:
RMSEP R2 Avg.Bias Max.Bias
2.2919 0.9012 -0.0298 -4.6312
Training set predictions:
V14.61 V17.196 V18.110 V16.227 V14.47 V23.22 V2.12 V23.29
4.1761 3.8285 4.0208 3.9927 8.4033 9.2821 3.0228 14.6312
V12.43 R9.7 A157.3 V23.81 V23.82 V12.53 V23.83 V12.56
14.5555 17.0486 15.9813 19.0779 18.5811 18.7827 17.5376 20.5445
A152.84 V16.50 V22.122 V16.41 V4.32 V12.66 V19.245 V4.8
19.9792 19.8121 18.7319 22.9201 22.4598 20.7307 22.4718 22.0932
A180.15 V18.34 V20.213 V19.222 A180.39 V16.189 V12.18 V7.67
21.4827 23.3740 23.3249 22.8605 24.2542 25.6955 25.5307 23.2759
V17.165 V19.310 V16.190 A153.154 V19.308 V22.172 V10.98 V22.219
23.6758 23.0183 24.5248 25.3909 25.8264 26.3325 24.0536 25.4655
V16.33 V22.204 V20.167 V10.89 V12.79 V19.216 V14.90 A180.72
26.3604 25.8272 26.7432 26.4347 26.0909 25.7002 25.8305 26.3170
V16.21 A180.76 V15.164 A180.78 V14.5 V3.128 A179.13 V9.31
26.7791 26.6893 26.8161 25.9479 26.8964 26.7957 26.3988 26.0191
V20.230 V20.7 V20.234 V18.21 V12.122
26.6089 27.2027 26.7317 26.9673 26.8097
With PCR, PLS, WA-PLS, and MAT we face an extra challenge
How to tune the model’s complexity parameter: \(l\) or \(k\)
How many components should we use? How many analogues?
Technically, we will underestimated RMSEP if we use CV to tune the model
We need a tuning test set
Form a test set from the training set (representative, evenly spaced) — set aside
Form a tuning or optimisation set from the training set (as above) — set aside
Build your TF using the remaining samples
You could also use a nested CV design — nothing readily available for anligue or rioja models though we should look at this!
A measure of reliability for the reconstructed values can be determined from the distance between each fossil sample and the training set samples
For a reconstructed value to be viewed as more reliable, it should have at least one close modern analogue in the training set
Close modern analogues are defined as those modern training set samples that are as similar to a fossil sample as a low percentile of the observed distribution dissimilarities in the training set, say the 5\(^{th}\) percentile
Which samples in the core have a close match in the training set?
We can use the bootstrap approach to generate sample specific errors for each fossil sample
\[\mathrm{RMSEP} = \sqrt{s^2_{1_{fossil}} + s^2_{2_{model}}}\]
\(s^2_{1_{fossil}}\) is the standard deviation of the bootstrap estimates for the fossil samples
\(s^2_{2_{model}}\) is the average bias, the mean of the bootstrap OOB residuals from the model
Juggins (Quaternary Science Reviews, 2013)
Are transfer functions too good to be true?
\(\mathbf{x}\) must have biological/physiological control on the taxa
Secondary nuisance variables are (very) problematic — they can bias the estimated species–environment relationships
You could be reconstructing variation in something else!
Variation downcore that has nothing to do with \(\mathbf{x}\) will show up as variation in \(\mathbf{x}_0\)
Work through the SWAP example